Machine Learning for Somatic Single Nucleotide Variant Detection in Cell-free Tumor Nucleic acid Sequencing Applications

ABSTRACT

Systems and methods are disclosed to detect single-nucleotide variations (SNVs) from somatic sources in a cell-free biological sample of a subject by generating training data with class labels; in computer memory, generating a machine learning unit comprising one output for each of adenine (A), cytosine (C), guanine (G), and thymine (T) calls; training the machine learning unit; and applying the machine learning unit to detect the SNVs from somatic sources in the cell-free biological sample of the subject, wherein the cell-free biological sample comprises a mixture of nucleic acid molecules from somatic and germline sources.

CROSS-REFERENCE

This application is a Divisional of U.S. patent application Ser. No.15/255,028, filed Sep. 1, 2016. which claims priority to U.S.Provisional Patent Application No. 62/213,448, filed Sep. 2, 2015, whichapplications are entirely incorporated herein by reference.

BACKGROUND

Single-nucleotide variation (SNV) detection is a critical step in atypical analysis pipeline for re-sequencing applications. It refers tothe detection (or determination) of single-base differences between anewly generated sequence and a reference sequence. Besides SNVs, thereare other common types of variations between an individual sample'ssequence and a reference sequence. Examples of such variations are: (1)indels (e.g., insertions or deletions), (2) copy number variations(CNVs), which may include changes involving very long stretches (e.g.,thousands or even millions of nucleotides), and (3) chromosomalrearrangements, such as gene fusions. Conventionally an indel (or Indel,InDel) is understood as either an insertion or a deletion at a givenlocation, with the plural form indels (or Indels, InDels). Although thedetection of these two latter types of variants is generally moredifficult than the detection of SNVs, the present disclosure may beapplied to these variations also, as will be clear to those skilled inthe relevant art.

Variant detection, including SNV detection, indel detection, and SV orCNV detection, follows a mapping or alignment step in the analysispipeline. Mapping or alignment refers to the operation by which theoriginal sequencing reads are mapped to the reference sequence. Becausethe sequencing reads are short, and there are many repeated regions inthe very long reference sequence (e.g., the human reference genome is ˜3billion nucleotides long), finding the precise position in the referencesequence where a read is mapped to is also challenging. Genome mappingmethods, which are known to those skilled in the art, are not discussedhere.

One reason SNV detection is difficult when using next-generationsequencing (NGS) approaches is because the error rate produced byconventional NGS technologies (e.g., Illumina technology) is commonlybelieved to be on the level of 0.1% to 1%, which is zero to one order ofmagnitude higher than the SNV rate (that is, the proportion ofnucleotides that are different between any two individuals, or betweenone's genome and the reference genome, e.g., ˜0.1%). Both SNVs andsequencing errors are reflected as differences between the sequencingdata and the reference sequence. In other words, the “noise” (e.g.,sequencing error) could be as high as one order of magnitude higher thanthe “signal” (e.g., real SNVs).

SUMMARY

The present disclosure provides methods and systems for detection ofsingle-nucleotide variations (SNVs) from somatic sources in a cell-freebiological sample of a subject, such as in a mixture of nucleic acidmolecules from somatic and germline sources.

Systems and methods are disclosed to detect single-nucleotide variations(SNVs) from somatic sources in a cell-free biological sample of asubject by generating training data with class labels; forming a machinelearning unit having one output for each of adenine (A), cytosine (C),guanine (G), and thymine (T) base calls, respectively; training themachine learning unit with a training set of biological samples; andapplying the machine learning unit to detect the SNVs from somaticsources in the cell-free biological sample, wherein the cell-freebiological sample may comprise a mixture of nucleic acid molecules(e.g., deoxyribonucleic acid (DNA)) from somatic and germline sources,e.g., cells comprising somatic mutations and germline DNA.

Advantages of the system may include one or more of the following. Thesystem can handle a large number of input features used for performingthe SNV detections. For instance, the GC content information (which isvery informative) can be utilized. The system can be highly scalable.The system does not necessarily rely on hard thresholds on the number ofmolecules which helps in scaling with a variable coverage. The systemcan make accurate calls even when deviations from nominal value (higheror lower). The system provides SNV detection optimality—in comparison,there is no guarantee that heuristic methods will result in the optimaldetection. The system provides probabilistic quantification with aquality score that can be globally used in a downstream probabilistic(e.g., Bayesian) method.

Additional aspects and advantages of the present disclosure will becomereadily apparent to those skilled in this art from the followingdetailed description, wherein only illustrative embodiments of thepresent disclosure are shown and described. As will be realized, thepresent disclosure is capable of other and different embodiments, andits several details are capable of modifications in various obviousrespects, all without departing from the disclosure. Accordingly, thedrawings and description are to be regarded as illustrative in nature,and not as restrictive.

INCORPORATION BY REFERENCE

All publications, patents, and patent applications mentioned in thisspecification are herein incorporated by reference to the same extent asif each individual publication, patent, or patent application wasspecifically and individually indicated to be incorporated by reference.

BRIEF DESCRIPTION OF THE DRAWINGS

The novel features of the disclosure are set forth with particularity inthe appended claims. A better understanding of the features andadvantages of the present disclosure will be obtained by reference tothe following detailed description that sets forth illustrativeembodiments, in which the principles of the disclosure are utilized, andthe accompanying drawings of which:

FIG. 1 shows an exemplary process to perform detection ofsingle-nucleotide variations (SNVs) from somatic sources in a cell-freebiological sample of a subject, wherein the cell-free biological samplecomprises a mixture of nucleic acid molecules from somatic and germlinesources.

FIG. 2A and FIG. 2B show an exemplary learning machine unit such as athree-output or four-output neural network.

FIG. 3 shows an exemplary digital sequencing system with the machinelearning unit for SNV detection.

FIG. 4 shows an exemplary training data set.

DETAILED DESCRIPTION

While various embodiments of the invention have been shown and describedherein, it will be obvious to those skilled in the art that suchembodiments are provided by way of example only. Numerous variations,changes, and substitutions may occur to those skilled in the art withoutdeparting from the invention. It should be understood that variousalternatives to the embodiments of the invention described herein may beemployed.

Cell-free deoxyribonucleic acid (DNA) from a subject with cancertypically will be a mixture of germline DNA and cancer cell DNA, whichcontains somatic mutations (hereinafter, “somatic DNA”). When such amixture is sequenced, base calls at a genomic base position(hereinafter, “base position”) among sequence reads in which the cancercells have a somatic mutation will be a combination of calls from thegermline DNA and calls from the somatic DNA. In addition, there ischance of sequencing errors. So, for example, 1,000 reads from a cancerpatient may produce calls of A=988, T=2, G=1, and C=9. In such a case, auser might call the presence of C at the genomic base position in thesample, and might call it at ˜1%. In this way, for each of a pluralityof genomic base positions in a sample, the presence of one or more basesat the genomic base position, the relative frequency of one or morebases at the genomic base position, and/or the probability of thepresence of each of one or more bases at the genomic base position canall be determined, or called.

Machine learning methods can be used to generate models that call thepresence of a base at a genomic base position in a sample comprisingmixed DNA (e.g., germline DNA and somatic DNA) with higher accuracy thana heuristic method, and, optionally, providing a confidence level of thecall. Such models can be generated by providing a machine learning unitwith training data in which the expected output is known in advance,e.g. an output in which it is known that 99% of the bases at a genomicbase position are A and 1% are C.

Such a training set can be provided as follows. Cell-free DNA from aplurality of presumably homogenous normal samples may be sequenced.These samples can be, for example, cell-free DNA from individuals who donot have a condition in which diseased cells comprise somatic mutations,e.g., healthy individuals or non-cancer individuals. This provides a setof sequences in which the base at each genomic base position examined isexpected to be the same for all molecules in the sample that arehomozygous at that corresponding locus. This can produce, for eachsample, a vector indicating, at each genomic base position, the countsof each base at the genomic base position.

The polymorphism rate in the human population is about 0.1% at anygenomic base position. Therefore, in any one sample, 0.1% of the basepositions are expected to harbor variants, or, in a panel of 160,000bases, about 160 variants per sample. Put another way, in a set of 1,000samples, in the worst case, at any particular base position, one mightexpect to find a single-nucleotide polymorphism (SNP) in about one ofthe samples.

At this point, the method can proceed with an in vitro approach or an insilico approach. Both approaches involve generating reads from mixturesof samples.

In an in vitro approach, a collection of samples comprising homogenousDNA, e.g., only germline DNA, is provided. The DNA is sequenced todetermine genotype at each base position of interest. DNA from thesamples is then mixed to provide mixed samples having known relativeproportions of DNA from each sample in the mixture. Suppose, forexample, one is provided with a set of 100 samples: a1, a2, a3, . . . ,a100. A set of mixtures can be produced in which each sample is mixedwith each other sample at a ratio of, for example, 99% to 1%. In oneexample, this can produce a set of 9,900 mixtures with composition asfollows: 99% a1×1% a2, 99% a1×1% a3, 99% a1×1% a4, . . . , 99% a1×1%a99, 99% a1×1% a100. Other mixture sets can also be provided. Forexample, a mixture set could be 99% a100×1% a99, 99% a100×1% a98, 99%a100×1% a97, . . . , 99% a100×1% a2, 99% a100×1% a1. In another example,the percentages can vary, for example, 95% a1×5% a2, 95% a1×5% a3, 95%a1×5% a4, . . . , 95% a1×5% a99, 95% a1×5% a100 (as shown in FIG. 4).

For each of these mixtures, the expected output in terms of base callsat each base position is known. For example, if it is known that at aparticular base position, sample a1 has A (which may be the base in thereference genome) and sample a2 has C, the expected output of sequencingthe mixture 99% a1×1% a2 is, for 1,000 reads, for A-TG-C, [990-0-0-10].

A “class label” can be applied to each sample indicating theclassification of the sample for any number of input features. Forexample, the class labels for the set of mixtures above could indicatethe identity of variants at various base positions (e.g., “C” in themixture 99% a1×1% a2, to indicate the sample has the presence of C at aparticular base position).

All the mixtures are sequenced, producing a vector including dataindicating, for each mixture, the read or molecule count (or %) of eachbase at each base position tested in each mixture. Other features of thesample can be included in the vector for each mixture, such as GCcontent at a base position, entropy, detection of reads from both oronly one strand, etc. This constitutes the training set.

In an in silico approach, each sample is sequenced to produce sets ofsequence reads. “Mixtures” are produced by, for example, combining readsfrom different samples in prescribed percentages. For example, mixture99% a1×1% a2, could include 990 reads from sample a1 and 10 reads fromsample a2. Ideally, these reads are randomly selected. The training setcomprises tally vectors from each in silico mixture, in addition toother selected features. Again, there is an “expected” output for eachmixture, and each mixture may be provided with one or more class labelsfor any number of input features.

The resulting training sets are provided to a machine learning unit,such as a neural network or a support vector machine. Using the trainingset, the machine learning unit may generate a model to classify thesample according to base identity at one or more base positions. This isalso referred to as “calling” a base. The model developed may employinformation from any part of a test vector. That is, it may use not onlyinformation about tally vectors from the base position in question, buttally vectors from other base positions proximal or distal to the testbase position or non-sequence read information included as a feature ofthe vector.

FIG. 1 shows an exemplary process to perform detection ofsingle-nucleotide variations (SNVs) from somatic sources, in a mixtureof somatic and germline cell-free DNA data. The process may use amulti-layer-perceptron (MLP) neural network shown in FIG. 2 as a machinelearning method. However, it must be noted that other machine learningmethods such as support-vector machines (SVMs), other neural networks(e.g., RBNN), neuro-fuzzy, and other methods could also be utilized.

Turning now to FIG. 1, an exemplary process is disclosed for detectingSNVs from somatic sources. The SNV calls are made from mixtures ofnucleic acid molecules from somatic and germline sources using theprocess that:

Generates training data with class labels (1)

Determines germline and somatic genotypes (2)

Uses Variant Base to emulate somatic change (3)

Generates training examples from permutation of each pair of normals (4)

Generates a machine learning unit such as a supervised learning machine,a SVM, 3 output or 4-output neural network (one output for each of A, C,G, and T calls) (5)

Generates machine learning data structure (6) considering inputfeatures:

-   -   biological significance in SNV detection,    -   tally vectors for each of the training set vectors, or    -   statistics of a series of normals samples at the base position        of interest    -   items (binary or real-valued) that represent existence or the        probability of disturbance caused by a repeat structure at the        base position of interest

Trains the learning unit (7)

Applies the learning unit to detect the SNVs from somatic sources in thecell-free biological sample of the subject, wherein the cell-freebiological sample comprises a mixture of somatic and germline sources(8)

Each of the foregoing steps is detailed next.

In step 1, to enable a supervised learning, a series of training dataalong with class labels are needed. In an embodiment, the training setcould be an in silico mixture of one normal cell-free DNA sample inanother normal cell-free DNA sample. The mixture can be formulated asa1*x1+a2*x2, where a1 and a2 are the mixture coefficients, and a1+a2=1,and where x1 and x2 are the first and second normal set, respectively.In this embodiment, a1 is much larger than a2. For example, if a1=0.99and a2=0.01, 1% of x2 is mixed with 99% of x1. Therefore, this scenariocould illustrate a case of 1% somatic in 99% germline background.

In step 2, the genotypes of x1 (emulating germline) and x2 (emulatingsomatic) are found, independently using their pure (pre-mix) data. Thiscan be done via various genotype calling algorithms.

For example, in a genotyping analysis, DNA from a population of severalindividuals can be analyzed by a set of multiplexed arrays. The data foreach multiplexed array may be self-normalized using the informationcontained in that specific array. This normalization algorithm mayadjust for nominal intensity variations observed in the two colorchannels, background differences between the channels, and possiblecrosstalk between the dyes. The behavior of each base position may thenbe modeled using a clustering algorithm that incorporates severalbiological heuristics on SNP genotyping. In cases where fewer than threeclusters are observed (e.g., due to low minor-allele frequency),locations and shapes of the missing clusters may be estimated usingneural networks. Depending on the shapes of the clusters and theirrelative distance to each other, a statistical score may be devised (aTraining score). A score such as GenCall Score is designed to mimicevaluations made by a human expert's visual and cognitive systems. Inaddition, it has been evolved using the genotyping data from top andbottom strands. This score may be combined with several penalty terms(e.g., low intensity, mismatch between existing and predicted clusters)in order to make up the Training score. The Training score, along withthe cluster positions and shapes for each SNP, is saved for use by thecalling algorithm.

To call genotypes for an individual's DNA, the calling algorithm maytake the DNA's intensity values and the information generated by theclustering algorithm; subsequently, it may then identify to whichcluster the data for any specific base position (of the DNA of interest)corresponds. The DNA data may first be normalized (using the sameprocedure as for the clustering algorithm). The calling operation(classification) may be performed using a Bayesian model. The score foreach call's Call Score is the product of the Training Score and adata-to-model fit score. After scoring all the base positions in the DNAof interest, the application may compute a composite score for that DNA(DNA Score). Subsequently, the Call score of each base position for thisDNA may be further penalized by the DNA Score. The Call Score may not bea probability, but a score, which may be designed to rank and filter outfailed genotypes, DNAs, and/or base positions. Call Scores may beaveraged among DNAs and among base positions for purposes of evaluatingthe quality of the genotyping within a particular DNA or base position.Using GC10 and GC50 Scores, a user may choose to fail particularly poorperforming base positions, for instance, by discarding base positionswith GC10 of 0.1 or lower. Also, a series of aggregate statistics (i.e.,average) of the GC10 or GC50 scores for each DNA may be used to identifylow-quality DNAs (for instance, a user may discard DNA samples withaverage GC10 scores of 0.2 or lower).

In step 3, reference bases (i.e., the bases that get similar callsbetween x1 and x2) are relatively uninformative. The homozygous andheterozygous calls, however, may be informative. For instance, in a casewhere x1 is a reference base (e.g., A) and x2 is a variant base (e.g.,C), the variant could emulate a somatic change. For instance, supposethe following tally vectors (counts of A, C, G, and T at a particularbase position) are available for x1 and x2, respectively: [1000 012] and[11000 3 1]. The mixture tally vector (0.99x1+0.01x2) could resemble thefollowing: [999 10 2 2] (note the stochastic changes in the smallvalues). In this case, A=999 could represent the germline contributionand C=10 could represent the somatic contribution in the mixture. Sincein an embodiment, an objective may be to find the somatic contribution,C would be the class label with the molecular support of 10.

In step 4, if there are N normals, the permutation of each pair ofnormals could be used to render a series of training examples (asdetailed above). For N normals, there would be N*(N−1) permutations. IfN»1, then this number can be approximated by N². Assuming a panel has Mbases, and the rate of polymorphism is 0.1%, there would be M/1000examples from each pair. Therefore, the total number of examplesavailable for training is T=N*(N−1)*M/1000. For N=30 and M=168,000, T isabout 150,000 training patterns for training a classifier. However, thisnumber may be filtered/reduced for various reasons, e.g, excludingheterozygotes or excluding calls with insufficient support.Nevertheless, a number of T of about 100,000 may be easily attainable.

In general, training data sets can be generated from training samples.These training samples are generated by: (a) preparing a plurality ofmixtures, wherein each mixture comprises a first normal cell-free DNAsample in a second normal cell-free DNA sample; and (b) sequencing thecell-free DNA in each mixture of the plurality of mixtures of cell-freeDNA. A set of mixtures may be prepared, said mixture comprising apermutation of substantially each pair or first and second normalcell-free DNA samples. Each mixture may comprise various relativeconcentrations of the first and second normal cell-free DNA samples,e.g., 1% and 99%, 2% and 98%, 3% and 97%, 4% and 96%, 5% and 95%, 6% and94%, 7% and 93%, 8% and 92%, 9% and 91%, 10% and 90%, 15% and 85%, 20%and 80%, 25% and 75%, 30% and 70%, 35% and 65%, 40% and 60%, 45% and55%, or 50% and 50%, respectively. Collectively, this set of trainingexamples may be used to generate a training data set to train a machinelearning unit.

In step 5, a three-output or four-output neural network (one output foreach of A, C, G, and T calls) can be devised. These outputs, givensigmoidal neurons of the type logsig in their neurons, will rendervalues in the range [0, 1], From a theoretical perspective, it is knownthat if an MLP neural network is trained properly, such outputs shouldconverge to the actual probability values. Note that each of A, C, G,and T can take values in the range [0, 1]; and therefore, the caller isindeed a possiblistic caller, which is advantageous to methods ofprobabilistic calls. This is due to the fact that the A, C, G and Toutputs are not forced to sum to 1. This means that a multi-call couldbe made (for instance, the call could be both A and G), which could bealluding to a heterogeneous tumor source (i.e., each tumor causing adifferent variant). In addition to each of the outputs generating asingle score, the neural network may further combine all outputs into asingle SNV detection.

In step 6, the inputs of the neural network comprise input features thathave biological significance in the somatic call (e.g., SNV detection)(or need to be removed from the germline contribution). Since somaticcalls (e.g., SNV detection) are based on coverage data mapped to areference genome, any factor contributing to a change in the coverage ofthe data could be considered as a useful input feature. An example ofsuch a feature is the GC-content. It is well known that GC-content isone of the strong contextual influencers of the coverage data in NGS.Often times, such relationship is removed via inverse functions, whichare subject to amplifying noise, particularly in the areas of the curvethat the slope is low (which results in high slopes in the inversefunction). In this invention, the GC-content could be corrected by theimplicit relationship that the machine learning unit (e.g. neuralnetwork) would find. In addition to the GC-content, other contextualinformation (many of which, unlike GC-content, do not have a direct andtangible correlation with coverage) could be used, e.g., the entropy ofone or more bases proximal to the base position of interest (in acertain radius).

In alternative 6B, other inputs of the neural network could include theactual tally vectors for each of the training set vectors. For example,the counts of A, C, G, and T at a certain base position could beconsidered as inputs. These counts can also be broken down to variouscomponents, for instance each of the A, C, G, and T based could bedecomposed into sub-count components, including single molecule support,more than 1 molecule support, Watson/Crick support, etc. Since anobjective is to make somatic calls, as an alternative embodiment, thebase with the highest tally number (believed to be from germline source)can be eliminated from the input set. For instance, if the count for Ais 999, then only C, G, and T could be input to the neural network and Acould be eliminated.

In alternative 6C, to reduce noise, statistics of a series of normalsamples at the base position of interest could also be considered asinputs. For instance, mean and standard deviation (or median andinterquartile range (IQR)) of the normals for each of the A, C, G, Tbases, or for the combination of the bases (sum of A, C, G, T) could beconsidered as other inputs.

In alternative 6D, other inputs could include features (binary orreal-valued) that represent existence or the probability of disturbancecaused by a repeat structure at the base position of interest. Examplesof such repeats include SINEs (in particular ALUs) and LINEs in thegenome.

Artificial neural networks (NNets) mimic networks of “neurons” based onthe neural structure of the brain. They process records one at a time,or in a batch mode, and “learn” by comparing their classification of therecord (which, at the outset, is largely arbitrary) with the knownactual classification of the record. In MLP-NNets, the errors from theinitial classification of the first record is fed back into the network,and are used to modify the network's algorithm the second time around,and so on for many iterations.

FIG. 2A shows an exemplary MLP with input layer neurons 14, hidden layerneurons 16, and output layer neurons 18. A minimum of one hidden layeris recommended if the neural network is to learn a general pattern. Thislayer could be linear or sigmoidal. More layers could also beconsidered.

Turning now to step 7, the process trains the neural network. As shownin FIG. 2B, a neuron in an artificial neural network is a set of inputvalues (xi) and associated weights (wi) and a function (g) that sums theweights and maps the results to an output (y). A bias (constant term) isalso provided to each neuron. Neurons are organized into layers. Theinput layer is composed not of full neurons, but rather consists simplyof the values in a data record, that constitute inputs to the next layerof neurons. The next layer is called a hidden layer; there may beseveral hidden layers. The final layer is the output layer, where, insome cases, there may be one node for each class. A single sweep forwardthrough the network results in the assignment of a value to each outputnode, and the record is assigned to whichever class's node had thehighest value.

In the training phase, the correct class for each record is known (thisis termed supervised training), and the output nodes can therefore beassigned “correct” values—“1” for the node corresponding to the correctclass, and “0” for the others. (In practice, it has been found better touse values of 0.9 and 0.1, respectively.) It is thus possible to comparethe network's calculated values for the output nodes to these “correct”values, and to calculate an error term for each node (the “Delta” rule).These error terms are then used to adjust the weights in the hiddenlayers so that, ideally, at each successive iteration, the output valueswill be closer to, and eventually converge to, the “correct” values.

The neural networks uses an iterative learning process in which datacases (rows) are presented to the network one at a time, and the weightsassociated with the input values are adjusted each time.

After all cases are presented, the process often starts over again.During this learning phase, the network learns by adjusting the weightsso as to be able to predict the correct class label of input samples.Neural network learning is also referred to as “connectionist learning,”due to connections between the units. Advantages of neural networksinclude their high tolerance to noisy data, as well as their ability toclassify patterns on which they have not been trained. One neuralnetwork algorithm is back-propagation algorithm, such asLevenberg-Marquadt. Once a network has been structured for a particularapplication, that network is ready to be trained. To start this process,the initial weights are chosen randomly. Then the training, or learning,begins.

The network processes the records in the training data one at a time,using the weights and functions in the hidden layers, then compares theresulting outputs against the desired outputs. Errors are thenpropagated back through the system, causing the system to adjust theweights for application to the next record to be processed. This processoccurs over and over as the weights are continually tweaked. During thetraining of a network the same set of data is processed many times asthe connection weights are continually refined.

In an embodiment with Feedforward, Back-Propagation neural networks, thetraining process uses some variant of the Delta Rule, which starts withthe calculated difference between the actual outputs and the desiredoutputs. Using this error, connection weights are increased inproportion to the error times a scaling factor for global accuracy.Doing this for an individual node means that the inputs, the output, andthe desired output all have to be present at the same processingelement. The system determines which input contributed the most to anincorrect output and how does that element get changed to correct theerror. An inactive node would not contribute to the error and would haveno need to change its weights. To solve this problem, training inputsare applied to the input layer of the network, and desired outputs arecompared at the output layer. During the learning process, a forwardsweep is made through the network, and the output of each element iscomputed layer by layer. The difference between the output of the finallayer and the desired output is back-propagated to the previouslayer(s), usually modified by the derivative of the transfer function,and the connection weights are normally adjusted using the Delta Rule.This process proceeds for the previous layer(s) until the input layer isreached.

Finally, the process applies the neural network to perform detection ofSNVs from somatic sources, in a mixture of somatic and germlinecell-free DNA data (8). This can be done by applying the input values ina feed-forward manner through the neural network to arrive at the SNVdetections.

In an embodiment, the training step of the machine learning unit on thetraining data set may generate one or more classification models forapplying to a test sample. These classification models may be applied toa test sample to (1) detect the presence of a base at each of aplurality of genomic base positions in a test sample, (2) to call arelative frequency of each of one or more bases at each of a pluralityof genomic base positions in the test sample, and/or (3) to call aprobability of the presence of each of one or more bases at each of aplurality of genomic base positions in the test sample.

Referring to FIG. 3, an illustrative embodiment of a digital sequencerand neural network system is shown and is designated. The systemreceives blood samples 303 from patients with cancer, among otherdiseases. Cancer tumors continually shed their unique genomic materialthe root cause of most cancers—into the bloodstream. As the telltalegenomic “signals” are so weak, next-generation sequencers can onlydetect such signals sporadically or in patients with terminally hightumor burden. Such error rates and bias can be orders of magnitudehigher than what is required to reliably detect de novo genomicalterations associated with cancer. To address this issue, a digitalsequencer 310 is used. The digital sequencer may be a nucleic acidsequencer, such as a DNA sequencer. The DNA sequencer may be ascientific instrument used to automate the DNA sequencing process. Givena sample of DNA, a DNA sequencer is used to determine the order of thefour bases: adenine (A), guanine (G), cytosine (C), and thymine (T). Theorder of the DNA bases may be reported as a text string, called a read.Some DNA signals may originate from fluorochromes attached tonucleotides. In an embodiment, the four bases (and corresponding inputparameters of the machine learning unit) may be adenine (A), guanine(G), cytosine (C), and uracil (U), corresponding to ribonucleic acid(RNA) sequencing.

In an embodiment, digital sequencers from Guardant Health can be used,which reduces the noise and distortion generated by next-generationsequencing to almost zero. The output of sequencer 310 is digitallyprocessed in unit 320 and stored in a data center with database(s) 330.While 310 illustrates a Sanger sequencing machine, other sequencers suchas Illumina HiSeq2500 can be used. Also, the DB 330 is not necessary asthe data from unit 320 can flow directly to a deep analyzer 340. Thedeep analyzer 340 can retrieve information from the database 330 and runhigh-level information with biomedical meanings from all available data,e.g., for understanding associations between genomic variations (e.g.,SNPs and CNVs) and clinical phenotypes, causal relationships betweenthose variations and phenotypes, or functional pathways in response toevolutionary, environmental, and physiological changes, among others.MapReduce-style parallel processing can be used for improvedperformance. For example, alignment, quality score recalibration, andvariation discovery algorithms can be parallelized using the open-sourceHadoop system.

FIG. 4 shows an exemplary training data set. The training set comprisesvalues for each of 1,000 different mixtures. Values for each mixtureinclude class labels which may identify the presence of sequences in themixture having named bases at particular base positions, and/or tallyvectors for counts of bases at each base position in each mixture.

Instead of a high-volume DNA sequencer, the system can use a handheldDNA sequencer or a desktop DNA sequencer. The DNA sequencer can applyGilbert's sequencing method based on chemical modification of DNAfollowed by cleavage at specific bases, or it can apply Sanger'stechnique which is based on dideoxynucleotide chain termination. TheSanger method became popular due to its increased efficiency and lowradioactivity. The DNA sequencer can use techniques that do not requireDNA amplification (polymerase chain reaction—PCR), which speeds up thesample preparation before sequencing and reduces errors. In addition,sequencing data is collected from the reactions caused by the additionof nucleotides in the complementary strand in real time. For example,the DNA sequencers can utilize a method called Single-molecule real-time(SMRT), where sequencing data is produced by light (captured by acamera) emitted when a nucleotide is added to the complementary strandby enzymes containing fluorescent dyes. Alternatively, the DNAsequencers can use electronic systems based on nanopore sensingtechnologies.

Data is sent by the DNA sequencers over a direct connection or over theinternet to a computer for processing. The data processing aspects ofthe system can be implemented in digital electronic circuitry, or incomputer hardware, firmware, software, or in combinations of them. Thedatabase 300 can also communicate with a cloud-based system such asAmazon's AWS system. One embodiment uses an elastic load balancer 410communicating with an auto-scaling instance group. The group has aplurality of security group members 420 with an instance 430 and BBSstorage 440. In case of digital sequence processing or deep geneticanalyzing/processing overload, the load balancer 410 can automaticallystarts additional instances to handle the load.

The cloud system 400 can be used for backup and instant recovery of anentire data center to a virtual computing environment. The system canalso provide backup and restoration of all data center componentsincluding, without limitation, physical machines, virtual machines,routers, networks, subnetworks, switches, firewall, directory lookup,DNS, DHCP, and internet access. In one implementation, a backup of thedatabase 300 and supporting local computers can be created, andconfiguration information of each working computer may be saved togetherwith each backup image, supplemental to the image. The configurationinformation includes persistent and volatile state. For example, eachbackup image is created as a snapshot of the corresponding computer. Thesnapshot comprises an application-consistent image as of a specificpoint in time of primary storage of the computer. Informationrepresenting the state of the network connections of the computer may besaved together with the configuration information saved with the backupimage, and the backup image may be loaded as a cloud computing node.Storage facilities from the source data base 300 may be recreated usingsoftware-defined storage in a public or private cloud. Storagefacilities may include an Internet Small Computer Systems Interface (iSC SI) storage, Fibre Channel (FC) storage, or Network Attached Storage(NAS). Storage in the cloud may include, for example, Amazon SimpleStorage Service (S3), AWS Storage Gateway, or Amazon Elastic Block Store(BBS).

In some embodiments, a metadata collection agent collects informationregarding the components in a data center. In some embodiments, themetadata collection agent is resident on each device of the data center.In other embodiments, the metadata collection agent is resident on anode operatively connected to the data center via a computer network,and operable to collect metadata regarding each device of the datacenter. In some embodiments, those components of a data center havingcomputer readable storage are automatically backed up to computerreadable backup media. Computer readable backup media may include harddisk drives (HDD), solid-state drives (SSD), tape, compact disk (CD),digital video (or versatile) disk (DVD), Flash, diskette, EPROM, orother optical or magnetic storage media known in the art. In someembodiments, all information gathered during the backup process is sentto a selected destination.

In some embodiments, the information is sent to a destination bytransport of a computer readable backup medium. In other embodiments,the information is replicated via a computer network to a disasterrecovery (DR) site or to a public cloud.

Information regarding each physical and virtual machine in the datacenter is collected. Such information may vary based on the type ofdevice in question, and may include: network configuration; diskvolumes; application state; and operating system state. Networkconfiguration information may include MAC addresses, IP addresses, openports, and network topology. Disk volume information may includecapacity, logical layout, partition maps, metadata regarding contents,as well as physical characteristics.

In some embodiments, the collection of information may includedetermining what components are required for each virtual or physicalmachine in the data center to operate. Examples of required componentsinclude: Network routers; Network firewalls; Internet Access; DirectoryLookup (e.g., AD, LDAP), Single Sign-On, DHCP, DNS; iSCSI storagedevices; FC storage devices; and NAS file servers (e.g., NFS or CIFS).Determining the required components may include collecting applicationspecific information for each machine in the data center, anddetermining application dependencies of each application. Determiningrequired components may also include analyzing network topologyinformation to determine connected or accessible devices.

The collected information and the backup data may be used to recreatethe source physical or virtual data center in a substitute data center.In some embodiments, the substitute data center may be software-definedto emulate the source data center. In some embodiments, the substitutedata center comprises a plurality of physical machines that aredynamically reconfigured to conform to the source data centerconfiguration. In other embodiments, the substitute data center maycomprise a plurality of virtual machines that are dynamically configuredto conform to the source data center configuration. In yet otherembodiments, the substitute data center may comprise a plurality ofcloud resources. The substitute data center may emulate, or provide thefunctional equivalent of, the source data center. In some embodiments,the substitute data center may provide a complete substitute for thesource data center. In other embodiments, the substitute data center mayprovide only a selected subset of functionality of the source datacenter. For example, where a source data center has computation,network, and storage aspects, a subset of this functionality may beselected to be performed by the substitute data center. In someembodiments, multiple substitute data centers may each substitute foraspects of the source data center.

The system can establish communications directly with a medicalpractice/healthcare provider (treating professional) and/or apatient/subject through communication links. The system can also receiveinformation from other labs such as medical laboratory, diagnosticlaboratory, medical facility, medical practice, point-of-care testingdevice, or any other remote data site capable of generating subjectclinical information. Subject clinical information includes, but it isnot limited to, laboratory test data, X-ray data, examination, anddiagnosis. The healthcare provider or practice may include medicalservices providers, such as doctors, nurses, home health aides,technicians and physician assistants, and the practice may be anymedical care facility staffed with healthcare providers. In certaininstances, the healthcare provider/practice may also be a remote datasite. In a cancer treatment embodiment, the subject may be afflictedwith cancer, among others.

Other clinical information for a cancer subject may include the resultsof laboratory tests, imaging or medical procedure directed towards thespecific cancer that one of ordinary skill in the art can readilyidentify. The list of appropriate sources of clinical information forcancer may include, but is not limited to: CT scan, MM scan, ultrasoundscan, bone scan, PET Scan, bone marrow test, barium X-ray, endoscopy,lymphangiogram, IVU (Intravenous urogram) or IVP (IV pyelogram), lumbarpuncture, cystoscopy, immunological tests (anti-maligning antibodyscreen), and cancer marker tests.

The subject's clinical information may be obtained from the lab manuallyor automatically. For simplicity of the system, the information isobtained automatically at predetermined or regular time intervals. Aregular time interval refers to a time interval at which the collectionof the laboratory data is carried out automatically by the methods andsystems described herein based on a measurement of time such as hours,days, weeks, months, years, etc. In an embodiment of the invention, thecollection of data and processing may be carried out at least once aday. In another embodiment, the transfer and collection of data may becarried out once every month, biweekly, or once a week, or once everycouple of days.

Alternatively, the retrieval of information may be carried out atpredetermined but not regular time intervals. For instance, a firstretrieval step may occur after one week, and a second retrieval step mayoccur after one month. The transfer and collection of data can becustomized according to the nature of the disorder that is being managedand the frequency of required testing and medical examinations of thesubjects.

The computer system can include a set of instructions that can beexecuted to cause the computer system to perform any one or more of themethods or computer-based functions disclosed herein. For example, thecomputer system may include instructions that are executable to performthe methods discussed with respect to FIGS. 1, 2A, and 2B. In particularembodiments, the computer system may include instructions to implementthe application of a training algorithm to train an artificial-neuralnetwork or implement operating an artificial-neural-network in afeed-forward manner. In particular embodiments, the computer system mayoperate in conjunction with other hardware that is designed to performmethods discussed above. The computer system may be connected to othercomputer systems or peripheral devices via a network. Additionally, thecomputer system may include or be included within other computingdevices. The computer system may include a processor, e.g., a centralprocessing unit (CPU), a graphics processing unit (GPU), or both.Moreover, the computer system can include a main memory and a staticmemory that can communicate with each other via a bus. As shown, thecomputer system may further include a video display unit, such as aliquid crystal display (LCD), a projection television display, a flatpanel display, a plasma display, or a solid state display. Additionally,the computer system may include an input device, such as a remotecontrol device having a wireless keypad, a keyboard, a microphonecoupled to a speech recognition engine, a camera such as a video cameraor still camera, or a cursor control device, such as a mouse device. Thecomputer system can also include a disk drive unit, a signal generationdevice, such as a speaker, and a network interface device. The networkinterface enables the computer system to communicate with other systemsvia a network. In a particular embodiment the disk drive unit mayinclude a non-transitory computer-readable medium in which one or moresets of instructions, e.g., software, can be embedded. For example,instructions for applying a training algorithm to anartificial-neural-network or instructions for operating an artificialneural-network in a feed-forward manner can be embedded in thecomputer-readable medium.

Further, the instructions may embody one or more of the methods or logicas described herein. In a particular embodiment, the instructions mayreside completely, or at least partially, within the main memory, thestatic memory, and/or within the processor during execution by thecomputer system. The main memory and the processor also may includecomputer-readable media.

In an alternative embodiment, dedicated hardware implementations, suchas application specific integrated circuits, programmable logic arraysand other hardware devices, can be constructed to implement one or moreof the methods described herein. Applications that may include theapparatus and systems of various embodiments can broadly include avariety of electronic and computer systems. One or more embodimentsdescribed herein may implement functions using two or more specificinterconnected hardware modules or devices with related control and datasignals that can be communicated between and through the modules, or asportions of an application-specific integrated circuit. Accordingly, thepresent system encompasses software, firmware, and hardwareimplementations, or combinations thereof.

While the computer-readable medium is shown to be a single medium, theterm “computer readable medium” includes a single medium or multiplemedia, such as a centralized or distributed database, and/or associatedcaches and servers that store one or more sets of instructions. The term“computer-readable medium” shall also include any medium that is capableof storing or encoding a set of instructions for execution by aprocessor or that cause a computer system to perform any one or more ofthe methods or operations disclosed herein.

In a particular non-limiting, exemplary embodiment, thecomputer-readable medium can include a solid-state memory such as amemory card or other package that houses one or more nonvolatileread-only memories. Further, the computer-readable medium can be arandom access memory (RAM) or other volatile re-writable memory.Additionally, the computer-readable medium can include a magneto-opticalor optical medium, such as a disk or tapes or other storage device tocapture carrier wave signals, such as a signal communicated over atransmission medium. Accordingly, the disclosure is considered toinclude any one or more of a computer-readable medium or otherequivalents and successor media, in which data or instructions may bestored.

It will be apparent from the foregoing embodiments that many othermodifications, variants, and embodiments are possible. For example,although the above described embodiments use computers or digital signalprocessing devices to emulate a plurality of neuron units, in anotherembodiment of the invention, the network is implemented by a custom VLSIcircuit comprising a rectangular array of neurons on a substrate, eachneuron unit comprising the structures described in relation to FIG. 6.This embodiment is suitable for very high speed operation, sincecalculations of all neurons in a given layer are performed in parallel,so that the total processing time required to execute the neural networkscales with the number of layers, rather than with the total number ofneurons as in the embodiments described above. In another embodiment,the neural network is implemented as a sampled analogue circuit, usinganalogue shift register devices (such as charge coupled devices (CCDs)or bucket brigade devices (BBDs)) as the serial to parallel converters,analogue multipliers and accumulators, and an analogue functiongenerator for the non-linear function. Multiplying digital to analogueconverters could be used as the weight multipliers, using the downloadeddigital weight values to generate an analogue signal to an analogueinput.

The illustrations of the embodiments described herein are intended toprovide a general understanding of the structure of the variousembodiments. The illustrations are not intended to serve as a completedescription of all of the elements and features of apparatus and systemsthat utilize the structures or methods described herein. Many otherembodiments may be apparent to those of skill in the art upon reviewingthe disclosure. Other embodiments may be utilized and derived from thedisclosure, such that structural and logical substitutions and changesmay be made without departing from the scope of the disclosure.Accordingly, the disclosure and the figures are to be regarded asillustrative rather than restrictive.

One or more embodiments of the disclosure may be referred to herein,individually and/or collectively, by the term “invention” merely forconvenience and without intending to voluntarily limit the scope of thisapplication to any particular invention or inventive concept. Moreover,although specific embodiments have been illustrated and describedherein, it should be appreciated that any subsequent arrangementdesigned to achieve the same or similar purpose may be substituted forthe specific embodiments shown. This disclosure is intended to cover anyand all subsequent adaptations or variations of various embodiments.Combinations of the above embodiments, and other embodiments notspecifically described herein, will be apparent to those of skill in theart upon reviewing the description.

While preferred embodiments of the present invention have been shown anddescribed herein, it will be obvious to those skilled in the art thatsuch embodiments are provided by way of example only. It is not intendedthat the invention be limited by the specific examples provided withinthe specification. While the invention has been described with referenceto the aforementioned specification, the descriptions and illustrationsof the embodiments herein are not meant to be construed in a limitingsense. Numerous variations, changes, and substitutions will now occur tothose skilled in the art without departing from the invention.Furthermore, it shall be understood that all aspects of the inventionare not limited to the specific depictions, configurations or relativeproportions set forth herein which depend upon a variety of conditionsand variables. It should be understood that various alternatives to theembodiments of the invention described herein may be employed inpracticing the invention. It is therefore contemplated that theinvention shall also cover any such alternatives, modifications,variations or equivalents. It is intended that the following claimsdefine the scope of the invention and that methods and structures withinthe scope of these claims and their equivalents be covered thereby.

What is claimed is:
 1. A method, comprising: (a) providing a trainingdata set comprising, for each mixture in a plurality of mixtures,wherein each mixture in the plurality comprises polynucleotides from aplurality of different subjects, values indicating: (i) a quantitativemeasure of each of a plurality of bases at each of a plurality ofgenomic base positions from sequence reads of a plurality ofpolynucleotides in the mixture, and (ii) a plurality of class labels,each class label classifying the mixture as having one or moreparticular bases at a particular genomic base position; and (b) traininga machine learning unit on the training data set to generate one or moreclassification models for detecting a presence of a base at each of aplurality of genomic base positions in a test sample.
 2. The method ofclaim 1, wherein the plurality of mixtures is provided by combiningpolynucleotides from a plurality of samples from different subjects inpredetermined amounts.
 3. The method of claim 1, wherein the one or moreclassification models further call a relative frequency of each of oneor more bases at each of a plurality of genomic base positions in thetest sample.
 4. The method of claim 1, wherein the one or moreclassification models further call a probability of the presence of eachof one or more bases at each of a plurality of genomic base positions inthe test sample.
 5. The method of claim 1, wherein the machine learningunit comprises a four-output neural network, a three-output neuralnetwork, a support vector machine (SVM), or another supervised learningmachine.
 6. The method of claim 1, further comprising: (c) providing atest data set comprising, for a test sample, values indicating aquantitative measure of each of a plurality of bases at each of aplurality of genomic base positions for sequence reads of a plurality ofpolynucleotides in the test sample; and (d) using a classification modelof (b) to call the presence of a base at a plurality of the genomic basepositions.
 7. A training set, comprising a plurality of mixtures ofcell-free DNA, wherein each mixture comprises a first normal cell-freeDNA sample in a second normal cell-free DNA sample in variouspredetermined permutations and relative concentrations.